Superstore Dataset Analysis - NTI Project
----------------------------------------------

(1) Packages

pandas: For data manipulation and analysis (DataFrames, filtering, etc.)¶

NumPy: For numerical operations on sales, profit, quantity, and discount columns; used to calculate totals, averages, percentages, and for array-based computations in the Superstore dataset¶

In [1]:
import pandas as pd
import numpy as np

matplotlib: Base plotting library for creating static, interactive visualizations¶

seaborn: Built on matplotlib, provides high-level interface for statistical graphics¶

In [4]:
import matplotlib.pyplot as plt
import seaborn as sns

rich: Provides enhanced console output with colors and formatting (optional)¶

In [7]:
from rich import print

To ignore warnings¶

In [10]:
import warnings
warnings.filterwarnings("ignore")

______________________________________________

(2) Data Preparation

In [14]:
# Load Excel file
df = pd.read_excel("Cleaned Sample - NTI Project (Python) - Superstore excel.xlsx")

pd.set_option('display.max_columns',None)

df.head(10)
Out[14]:
Order ID Order Date Ship Date Difference (Days) Ship Mode Customer ID Customer Name Segment City State Postal Code Region Product ID Category Sub-Category Product Name Price Per Unit Quantity Discounted? Discount Sales Profit Cost
0 CA-2014-103800 2014-01-03 2014-01-07 4 Standard Class DP-13000 Darren Powers Consumer Seattle Texas 77095 Central OFF-PA-10000174 Office Supplies Paper Message Book, Wirebound, Four 5 1/2" X 4" Form... 10.281250 2 YES 0.2 16.45 5.55 20.5625
1 CA-2014-112326 2014-01-04 2014-01-08 4 Standard Class PO-19195 Phillina Ober Home Office San Diego Illinois 60540 Central OFF-LA-10003223 Office Supplies Labels Avery 508 4.908333 3 YES 0.2 11.78 4.27 14.7250
2 CA-2014-112326 2014-01-04 2014-01-08 4 Standard Class PO-19195 Phillina Ober Home Office San Diego Illinois 60540 Central OFF-ST-10002743 Office Supplies Storage SAFCO Boltless Steel Shelving 113.641667 3 YES 0.2 272.74 -64.77 340.9250
3 CA-2014-112326 2014-01-04 2014-01-08 4 Standard Class PO-19195 Phillina Ober Home Office Evanston Illinois 60540 Central OFF-BI-10004094 Office Supplies Binders GBC Standard Plastic Binding Systems Combs 8.850000 2 YES 0.8 3.54 -5.49 17.7000
4 CA-2014-141817 2014-01-05 2014-01-12 7 Standard Class MB-18085 Mick Brown Consumer Baltimore Pennsylvania 19143 East OFF-AR-10003478 Office Supplies Art Avery Hi-Liter EverBold Pen Style Fluorescent ... 8.141667 3 YES 0.2 19.54 4.88 24.4250
5 CA-2014-130813 2014-01-06 2014-01-08 2 Second Class LS-17230 Lycoris Saunders Consumer Akron California 90049 West OFF-PA-10002005 Office Supplies Paper Xerox 225 6.480000 3 NO 0.0 19.44 9.33 19.4400
6 CA-2014-106054 2014-01-06 2014-01-07 1 First Class JO-15145 Jack O'Briant Corporate San Jose Georgia 30605 South OFF-AR-10002399 Office Supplies Art Dixon Prang Watercolor Pencils, 10-Color Set w... 4.260000 3 NO 0.0 12.78 5.24 12.7800
7 CA-2014-167199 2014-01-06 2014-01-10 4 Standard Class ME-17320 Maria Etezadi Home Office Greenville Kentucky 42420 South FUR-CH-10004063 Furniture Chairs Global Deluxe High-Back Manager's Chair 285.980000 9 NO 0.0 2573.82 746.41 2573.8200
8 CA-2014-167199 2014-01-06 2014-01-10 4 Standard Class ME-17320 Maria Etezadi Home Office Los Angeles Kentucky 42420 South OFF-BI-10004632 Office Supplies Binders Ibico Hi-Tech Manual Binding System 304.990000 2 NO 0.0 609.98 274.49 609.9800
9 CA-2014-167199 2014-01-06 2014-01-10 4 Standard Class ME-17320 Maria Etezadi Home Office Houston Kentucky 42420 South OFF-AR-10001662 Office Supplies Art Rogers Handheld Barrel Pencil Sharpener 2.740000 2 NO 0.0 5.48 1.48 5.4800
In [16]:
# Basic info: column types, non-null values
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9994 entries, 0 to 9993
Data columns (total 23 columns):
 #   Column             Non-Null Count  Dtype         
---  ------             --------------  -----         
 0   Order ID           9994 non-null   object        
 1   Order Date         9994 non-null   datetime64[ns]
 2   Ship Date          9994 non-null   datetime64[ns]
 3   Difference (Days)  9994 non-null   int64         
 4   Ship Mode          9994 non-null   object        
 5   Customer ID        9994 non-null   object        
 6   Customer Name      9994 non-null   object        
 7   Segment            9994 non-null   object        
 8   City               9994 non-null   object        
 9   State              9994 non-null   object        
 10  Postal Code        9994 non-null   int64         
 11  Region             9994 non-null   object        
 12  Product ID         9994 non-null   object        
 13  Category           9994 non-null   object        
 14  Sub-Category       9994 non-null   object        
 15  Product Name       9994 non-null   object        
 16  Price Per Unit     9994 non-null   float64       
 17  Quantity           9994 non-null   int64         
 18  Discounted?        9994 non-null   object        
 19  Discount           9994 non-null   float64       
 20  Sales              9994 non-null   float64       
 21  Profit             9994 non-null   float64       
 22  Cost               9994 non-null   float64       
dtypes: datetime64[ns](2), float64(5), int64(3), object(13)
memory usage: 1.8+ MB
In [18]:
from ydata_profiling import ProfileReport
report = ProfileReport(df)
Upgrade to ydata-sdk

Improve your data and profiling with ydata-sdk, featuring data quality scoring, redundancy detection, outlier identification, text validation, and synthetic data generation.

In [20]:
report
Summarize dataset:   0%|          | 0/5 [00:00<?, ?it/s]
  0%|          | 0/23 [00:00<?, ?it/s]
100%|██████████| 23/23 [00:00<00:00, 91.32it/s][A
Generate report structure:   0%|          | 0/1 [00:00<?, ?it/s]
Render HTML:   0%|          | 0/1 [00:00<?, ?it/s]
Out[20]:

______________________________________________

(3) Missing Valuess ??

In [79]:
# dataset doesn't have missing values
df.isnull().values.any()
Out[79]:
False
In [65]:
print("\nMissing Values (True = Missing):")
print(df.isnull().head())
Missing Values (True = Missing):
   Order ID  Order Date  Ship Date  Difference (Days)  Ship Mode  Customer ID  \
0     False       False      False              False      False        False   
1     False       False      False              False      False        False   
2     False       False      False              False      False        False   
3     False       False      False              False      False        False   
4     False       False      False              False      False        False   

   Customer Name  Segment   City  Remove D  State  Postal Code  Region  \
0          False    False  False     False  False        False   False   
1          False    False  False     False  False        False   False   
2          False    False  False     False  False        False   False   
3          False    False  False     False  False        False   False   
4          False    False  False     False  False        False   False   

   Product ID  Category  Sub-Category  Product Name  Price Per Unit  Quantity  \
0       False     False         False         False           False     False   
1       False     False         False         False           False     False   
2       False     False         False         False           False     False   
3       False     False         False         False           False     False   
4       False     False         False         False           False     False   

   Discounted?  Discount  Sales  Profit   Cost  City (UNIQUE)  \
0        False     False  False   False  False          False   
1        False     False  False   False  False          False   
2        False     False  False   False  False          False   
3        False     False  False   False  False          False   
4        False     False  False   False  False          False   

   Count (COUNTIF)  Total Sales per City  
0            False                 False  
1            False                 False  
2            False                 False  
3            False                 False  
4            False                 False  
In [153]:
print("[bold #FF0000]Total missing values per column[/bold #FF0000]")
print(f"[bold]{df.isnull().sum()}[/bold]")
Total missing values per column
Order ID             0
Order Date           0
Ship Date            0
Difference (Days)    0
Ship Mode            0
Customer ID          0
Customer Name        0
Segment              0
City                 0
State                0
Postal Code          0
Region               0
Product ID           0
Category             0
Sub-Category         0
Product Name         0
Price Per Unit       0
Quantity             0
Discounted?          0
Discount             0
Sales                0
Profit               0
Cost                 0
dtype: int64
In [155]:
print("\n[bold blue]Total Missing Values in the Dataset:[/bold blue]")
print(f"[bold #FF0000]{df.isnull().sum().sum()} [/bold #FF0000]")
Total Missing Values in the Dataset:
0 

______________________________________________

Duplicates¶

In [137]:
print(f"[bold] {df.duplicated().any()}[/bold]")
 False

______________________________________________ (4) Descriptive Analysis of Profit
Overall Profit summary (mean, median, min, max, std).

In [94]:
print(df['Profit'].describe())
count    9994.000000
mean       28.656973
std       234.260203
min     -6599.980000
25%         1.730000
50%         8.665000
75%        29.360000
max      8399.980000
Name: Profit, dtype: float64

Total Profit

In [102]:
print(df['Profit'].sum())
286397.79

______________________________________________ (5) Profit by Category and Sub-Category
Which Categories (Furniture, Technology, Office Supplies) generate more profit?

In [108]:
print(df.groupby('Category')['Profit'].sum().sort_values(ascending=False))
Category
Technology         145455.66
Office Supplies    122490.88
Furniture           18451.25
Name: Profit, dtype: float64

Which Sub-Categories have negative or low profit?

In [113]:
print(df.groupby('Sub-Category')['Profit'].sum().sort_values(ascending=False))
Sub-Category
Copiers        55617.90
Phones         44516.25
Accessories    41936.78
Paper          34053.34
Binders        30221.64
Chairs         26590.15
Storage        21279.05
Appliances     18138.07
Furnishings    13059.25
Envelopes       6964.10
Art             6527.96
Labels          5546.18
Machines        3384.73
Fasteners        949.53
Supplies       -1188.99
Bookcases      -3472.56
Tables        -17725.59
Name: Profit, dtype: float64

______________________________________________ (6) Profit by Region / State / City
Geographic breakdown of profit (top/bottom states/cities)

In [121]:
print(df.groupby('Region')['Profit'].sum().sort_values(ascending=False))
Region
West       108418.79
East        91522.84
South       46749.71
Central     39706.45
Name: Profit, dtype: float64

Identify loss-making locations

In [125]:
print(df.groupby('State')['Profit'].sum().nlargest(10))
State
California    76381.60
New York      74038.64
Washington    33402.70
Michigan      24463.15
Virginia      18598.00
Indiana       18382.97
Georgia       16250.08
Kentucky      11199.70
Minnesota     10823.22
Delaware       9977.37
Name: Profit, dtype: float64
In [130]:
print(df.groupby('State')['Profit'].sum().nsmallest(10))
State
Texas            -25729.29
Ohio             -16971.37
Pennsylvania     -15560.04
Illinois         -12607.89
North Carolina    -7490.81
Colorado          -6527.86
Tennessee         -5341.66
Arizona           -3427.87
Florida           -3399.25
Oregon            -1190.48
Name: Profit, dtype: float64

______________________________________________ (7) Visualization
A) Profit vs. Sales
Relationship between Sales and Profit (sometimes high sales --> low profit due to discounts)

In [159]:
plt.scatter(df['Sales'], df['Profit'])
plt.xlabel('Sales')
plt.ylabel('Profit')
plt.title('Sales vs Profit')
plt.show()
No description has been provided for this image

B) Profit by Category

In [161]:
plt.figure(figsize=(10, 6))
category_profit = df.groupby('Category')['Profit'].sum().sort_values(ascending=False)
ax = sns.barplot(x=category_profit.index, y=category_profit.values, palette='viridis')

plt.title('Profit by Category', fontsize=18, fontweight='bold', style='italic', color="#08615A")
plt.xlabel('Category', fontsize=15, fontweight='bold', color="#239406")
plt.ylabel('Profit', fontsize=15, fontweight='bold', color="#360078")
plt.xticks(rotation=20, ha='center')
plt.tight_layout()

palette = sns.color_palette("viridis", n_colors=len(category_profit))
for i, ticklabel in enumerate(ax.get_xticklabels()):
    ticklabel.set_color(palette[i])
    ticklabel.set_fontweight('bold')
    ticklabel.set_fontstyle('italic')

plt.show()
No description has been provided for this image

C) Profit by Region

In [194]:
plt.figure(figsize=(10, 6))
region_profit = df.groupby('Region')['Profit'].sum().sort_values(ascending=False)
ax = sns.barplot(x=region_profit.index, y=region_profit.values, palette='coolwarm')

plt.title('Profit by Region', fontsize=18, fontweight='bold', style='italic', color="#590057")
plt.xlabel('Region', fontsize=15, fontweight='bold', color="#084298")
plt.ylabel('Profit', fontsize=15, fontweight='bold', color="#198754")
plt.tight_layout()

plt.show()
No description has been provided for this image

D) Profit by Sub-Category

In [178]:
import plotly.express as px

sub_profit = df.groupby('Sub-Category')['Profit'].sum().reset_index().sort_values('Profit', ascending=False)

fig = px.bar(
    sub_profit,
    x='Sub-Category',
    y='Profit',
    text='Profit',
    title="Profit by Sub-Category",
    color='Profit',
    color_continuous_scale='Blues'
)

fig.update_traces(
    texttemplate='%{text:,.0f}',
    textposition='outside',
    hovertemplate='<b>%{x}</b><br>Profit: %{y:,.0f}'
)


fig.update_layout(
    title="Profit by Sub-Category",
    width=900,
    height=500,
    title_x=0.3,
    title_font=dict(size=20, color='#070047',style='italic'),
    xaxis_tickangle=-45
)

fig.show()

E) Profit vs Discount

In [220]:
fig = px.scatter(
    df,
    x="Discount",
    y="Profit",
    color="Category",
    size="Sales",
    title="<b><i>Discount Impact on Profit</i></b>",
    hover_data=['Sub-Category','Product Name']
)

fig.update_traces(marker=dict(opacity=0.7, line=dict(width=1, color='DarkSlateGrey')))
fig.update_layout(title_x=0.3, title_font=dict(size=20, color='#004085',style='italic'))

fig.show()

F) Profit by Segment

In [308]:
segment_profit = df.groupby('Segment')['Profit'].sum()

fig = px.pie(
    names=segment_profit.index,
    values=segment_profit.values,
    title="<b>Profit Distribution by Customer Segment</b>",
    hole=0.3,
    color_discrete_sequence=px.colors.qualitative.Dark24_r
)

fig.update_traces(
    textposition='inside',
    textinfo='percent+label',
    hovertemplate='Segment:<b> %{label}</b><br>Profit: <b>%{value:,.0f}</b>',
    insidetextfont=dict(size=14)
)


fig.update_layout(
    title=dict(
        x=0.5,  
        font=dict(size=20)
    ),
    legend=dict(
        orientation="h",
        y=-0.1,
        x=0.35
    )
)

fig.show()

G) Profit Trend Over Time

In [326]:
df['Order Date'] = pd.to_datetime(df['Order Date'])
df['YearMonth'] = df['Order Date'].dt.to_period('M').astype(str)

profit_trend = df.groupby('YearMonth')['Profit'].sum().reset_index()

fig = px.line(
    profit_trend,
    x='YearMonth',
    y='Profit',
    title="Profit Trend Over Time",
    markers=True
)

fig.update_traces(line=dict(color='#0077b6', width=3))

fig.update_layout(
    title=dict(
        x=0.25,  
        font=dict(size=20, color='#003049')
    )
)

fig.show()

______________________________________________

(5) Calculate RFM metrics
define Recency, Frequency, Monetary for each customer

In [22]:
latest_date = df['Order Date'].max() + pd.Timedelta(days=1)

rfm = df.groupby('Customer ID').agg({
    'Order Date': lambda x: (latest_date - x.max()).days,  # Recency
    'Order ID': 'nunique',                                # Frequency
    'Sales': 'sum'                                        # Monetary
})

rfm.columns = ['Recency', 'Frequency', 'Monetary']

rfm.head()
Out[22]:
Recency Frequency Monetary
Customer ID
AA-10315 185 5 5563.56
AA-10375 20 9 1056.39
AA-10480 260 4 1790.51
AA-10645 56 6 5086.94
AB-10015 416 3 886.15

🤠loyal customers (low Recency, high Frequency, high Monetary)
😮‍💨churned customers (high Recency, low Frequency, low Monetary)
🤑big spenders (low Monetary --> low interest, high Monetary --> VIP)

In [30]:
rfm_classified = rfm.copy()

def classify_customer(row):
    labels = []
    
    if row['Recency'] < 30 and row['Frequency'] > 5 and row['Monetary'] > 2000:
        labels.append("Loyal")
    
    if row['Recency'] > 100 and row['Frequency'] <= 2 and row['Monetary'] < 5000:
        labels.append("Churned")
    
    if row['Monetary'] > 10000:
        labels.append("Big Spenders")

    if row['Recency'] < 60:
        labels.append("Active")
    
    if row['Frequency'] > 3:
        labels.append("Regular Buyer")
    
    if row['Monetary'] > 1000:
        labels.append("Good Spender")

    return " & ".join(labels) if labels else "Unclassified"

rfm_classified["Classification"] = rfm_classified.apply(classify_customer, axis=1)

rfm_classified.head(10)
Out[30]:
Recency Frequency Monetary Classification
Customer ID
AA-10315 185 5 5563.56 Regular Buyer & Good Spender
AA-10375 20 9 1056.39 Active & Regular Buyer & Good Spender
AA-10480 260 4 1790.51 Regular Buyer & Good Spender
AA-10645 56 6 5086.94 Active & Regular Buyer & Good Spender
AB-10015 416 3 886.15 Unclassified
AB-10060 55 8 7755.63 Active & Regular Buyer & Good Spender
AB-10105 42 10 14473.57 Big Spenders & Active & Regular Buyer & Good S...
AB-10150 42 5 966.71 Active & Regular Buyer
AB-10165 26 8 1113.85 Active & Regular Buyer & Good Spender
AB-10255 167 9 914.53 Regular Buyer

______________________________________________

(6) Assign R, F, M Scores
Use quantiles to give each metric a score from 1 (lowest) to 4 (highest)

In [337]:
# Create quantile-based scoring functions with rank to break ties
rfm['R_Score'] = pd.qcut(rfm['Recency'], 4, labels=[4,3,2,1])
rfm['F_Score'] = pd.qcut(rfm['Frequency'].rank(method='first'), 4, labels=[1,2,3,4])
rfm['M_Score'] = pd.qcut(rfm['Monetary'], 4, labels=[1,2,3,4])


rfm.head()
Out[337]:
Recency Frequency Monetary R_Score F_Score M_Score RFM_Score
Customer ID
AA-10315 185 5 5563.56 1 1 4 114
AA-10375 20 9 1056.39 4 4 1 441
AA-10480 260 4 1790.51 1 1 2 112
AA-10645 56 6 5086.94 3 2 4 324
AB-10015 416 3 886.15 1 1 1 111

______________________________________________

(7) Customer Segmentation
Classify customers based on their RFM score patterns

In [345]:
def classify_rfm(rfm_score):
    recency_map = {
        '4': "Loyal",
        '3': "Engaged",
        '2': "At Risk",
        '1': "Churned"
    }
    frequency_map = {
        '4': "Frequent",
        '3': "Occasional",
        '2': "Seldom",
        '1': "Rare"
    }
    monetary_map = {
        '4': "VIP",
        '3': "Premium",
        '2': "Average",
        '1': "Budget"
    }
    
    r_label = recency_map.get(rfm_score[0], "Unknown")
    f_label = frequency_map.get(rfm_score[1], "Unknown")
    m_label = monetary_map.get(rfm_score[2], "Unknown")
    
    return f"{r_label} {f_label} {m_label}"

# Apply to DataFrame
rfm['Segment'] = rfm['RFM_Score'].apply(classify_rfm)

rfm.head()
Out[345]:
Recency Frequency Monetary R_Score F_Score M_Score RFM_Score Segment
Customer ID
AA-10315 185 5 5563.56 1 1 4 114 Churned Rare VIP
AA-10375 20 9 1056.39 4 4 1 441 Loyal Frequent Budget
AA-10480 260 4 1790.51 1 1 2 112 Churned Rare Average
AA-10645 56 6 5086.94 3 2 4 324 Engaged Seldom VIP
AB-10015 416 3 886.15 1 1 1 111 Churned Rare Budget
In [ ]: